[ENH] Put both token id and token str in the statistics #5777

rescrv · 2025-10-30T23:31:17Z

Description of changes

We need to put both tokens in both numeric and string form into the
outputs for statistics functions.

Test plan

CI

Migration plan

N/A

Observability plan

N/A

Documentation Changes

N/A

github-actions · 2025-10-30T23:31:32Z

propel-code-bot · 2025-10-30T23:31:55Z

Add string label alongside index for SparseVector statistics

Adds optional string labels to sparse-vector statistics so that each token is emitted in both numeric (index) and human-readable (label) form. This required extending StatisticsValue::SparseVector to carry an Option<String>, updating all match arms, hash/eq implementations, display helpers, record-building logic, deserialisation in load_existing_statistics, and metadata emitted by the statistics executor.

Key Changes

• Changed StatisticsValue::SparseVector from SparseVector(u32) to SparseVector(u32, Option<String>)
• Added helper methods stable_value_index(), stable_value_label(), stable_value_string() and updated type_prefix()/stable_type()
• Updated from_metadata_value, equality, hashing, and display impls to handle new tuple form
• Emit new metadata field value_label when label present; reader path now parses it to remain backward-compatible
• Refactored record-emission to use stable_value_index/stable_value_string and include value_label when present
• Adjusted tests and parser to support optional labels for legacy statistics

Affected Areas

• rust/worker/src/execution/functions/statistics.rs (all statistics generation and parsing logic)
• Metadata schema for statistics records (value_label field added)

This summary was automatically generated by @propel-code-bot

rust/worker/src/execution/functions/statistics.rs

tanujnay112

Naming should be improved

rust/types/src/metadata.rs

propel-code-bot · 2025-11-22T18:18:14Z

rust/types/src/metadata.rs

@@ -156,7 +113,7 @@ impl SparseVector {
        }
    }


[BestPractice]

The implementation of from_pairs can be simplified by using unzip, which is more idiomatic and likely more efficient. I've also renamed the parameter from triples to pairs for clarity.

Suggested change

pub fn from_pairs(pairs: impl IntoIterator<Item = (u32, f32)>) -> Self {

let (indices, values) = pairs.into_iter().unzip();

Self {

indices,

values,

tokens: None,

}

}

Context for Agents

The implementation of `from_pairs` can be simplified by using `unzip`, which is more idiomatic and likely more efficient. I've also renamed the parameter from `triples` to `pairs` for clarity. ```suggestion pub fn from_pairs(pairs: impl IntoIterator<Item = (u32, f32)>) -> Self { let (indices, values) = pairs.into_iter().unzip(); Self { indices, values, tokens: None, } } ``` File: rust/types/src/metadata.rs Line: 114

propel-code-bot · 2025-11-22T18:20:49Z

rust/worker/src/execution/functions/statistics.rs

+                if let Some(stable_value_token) = stats_value.stable_value_token() {
+                    metadata.insert(
+                        "value_token".to_string(),
+                        UpdateMetadataValue::Str(stable_value_token),
+                    );
+                }


[TestCoverage]

This new logic to add value_token is a great addition. However, it doesn't appear to be covered by tests. The existing tests for sparse vectors seem to only cover cases where tokens are not provided, exercising the None path for stable_value_token.

To ensure this new functionality is robust, could you please add a test case with a SparseVector that includes tokens? This test should assert that the value_token field is correctly populated in the resulting statistics records. This would likely require updating test helpers like extract_metadata_tuple as well.

Context for Agents

This new logic to add `value_token` is a great addition. However, it doesn't appear to be covered by tests. The existing tests for sparse vectors seem to only cover cases where tokens are not provided, exercising the `None` path for `stable_value_token`. To ensure this new functionality is robust, could you please add a test case with a `SparseVector` that includes tokens? This test should assert that the `value_token` field is correctly populated in the resulting statistics records. This would likely require updating test helpers like `extract_metadata_tuple` as well. File: rust/worker/src/execution/functions/statistics.rs Line: 255

propel-code-bot · 2025-11-22T21:37:34Z

rust/worker/src/execution/orchestration/compact.rs

+        // Wrap in Arc to avoid cloning large MaterializeLogOutput data
+        let log_fetch_records_clone = log_fetch_records.clone();


[BestPractice]

Resource leak in concurrent compaction handling: The clone() operations at lines 603-604 create deep copies of log_fetch_records and entire CompactionContext for parallel execution. With large datasets (e.g., 10k records), this duplicates significant memory without cleanup guarantees if one future fails.

// Current: Clones entire state let log_fetch_records_clone = log_fetch_records.clone(); let mut self_clone_fn = self.clone(); // Safer approach: Use Arc to share data let log_fetch_records = Arc::new(log_fetch_records); let fn_records = Arc::clone(&log_fetch_records);

Context for Agents

**Resource leak in concurrent compaction handling**: The `clone()` operations at lines 603-604 create deep copies of `log_fetch_records` and entire `CompactionContext` for parallel execution. With large datasets (e.g., 10k records), this duplicates significant memory without cleanup guarantees if one future fails. ```rust // Current: Clones entire state let log_fetch_records_clone = log_fetch_records.clone(); let mut self_clone_fn = self.clone(); // Safer approach: Use Arc to share data let log_fetch_records = Arc::new(log_fetch_records); let fn_records = Arc::clone(&log_fetch_records); ``` File: rust/worker/src/execution/orchestration/compact.rs Line: 604

propel-code-bot · 2025-11-22T21:37:34Z

rust/worker/src/execution/orchestration/compact.rs

-        .run_compaction(collection_id, system)
-        .await;
+    let result = Box::pin(compaction_context.run_compaction(collection_id, system)).await;
    Box::pin(compaction_context.cleanup()).await;


[BestPractice]

Potential panic from unbounded recursion: Box::pin() at line 723 wraps an async function that can recursively call itself through orchestrator chains. With deep call stacks (e.g., multiple attached functions), this risks stack overflow despite the 8MB thread stack allocation in tests.

// Add depth tracking to prevent unbounded recursion pub async fn compact( // ... params max_depth: Option<u32>, ) -> Result<CompactionResponse, CompactionError> { if let Some(depth) = max_depth { if depth == 0 { return Err(CompactionError::InvariantViolation( "Maximum orchestration depth exceeded".to_string() )); } } // ... rest of function }

Context for Agents

**Potential panic from unbounded recursion**: `Box::pin()` at line 723 wraps an async function that can recursively call itself through orchestrator chains. With deep call stacks (e.g., multiple attached functions), this risks stack overflow despite the 8MB thread stack allocation in tests. ```rust // Add depth tracking to prevent unbounded recursion pub async fn compact( // ... params max_depth: Option<u32>, ) -> Result<CompactionResponse, CompactionError> { if let Some(depth) = max_depth { if depth == 0 { return Err(CompactionError::InvariantViolation( "Maximum orchestration depth exceeded".to_string() )); } } // ... rest of function } ``` File: rust/worker/src/execution/orchestration/compact.rs Line: 723

propel-code-bot · 2025-11-22T21:37:35Z

rust/worker/src/execution/orchestration/compact.rs

+        for hnsw_index_uuid in self.hnsw_index_uuids {
+            let _ = HnswIndexProvider::purge_one_id(
+                self.hnsw_provider.temporary_storage_path.as_path(),
+                hnsw_index_uuid,
+            )
+            .await;


[BestPractice]

Missing error propagation in cleanup path: cleanup() at line 666 uses let _ = to ignore HNSW purge errors. If cleanup fails during error handling (e.g., disk full), subsequent compactions will leak resources without visibility.

pub(crate) async fn cleanup(self) { let mut cleanup_errors = Vec::new(); for hnsw_index_uuid in self.hnsw_index_uuids { if let Err(e) = HnswIndexProvider::purge_one_id( self.hnsw_provider.temporary_storage_path.as_path(), hnsw_index_uuid, ).await { cleanup_errors.push((hnsw_index_uuid, e)); } } if !cleanup_errors.is_empty() { tracing::warn!("Cleanup failures: {:?}", cleanup_errors); } }

Context for Agents

**Missing error propagation in cleanup path**: `cleanup()` at line 666 uses `let _ =` to ignore HNSW purge errors. If cleanup fails during error handling (e.g., disk full), subsequent compactions will leak resources without visibility. ```rust pub(crate) async fn cleanup(self) { let mut cleanup_errors = Vec::new(); for hnsw_index_uuid in self.hnsw_index_uuids { if let Err(e) = HnswIndexProvider::purge_one_id( self.hnsw_provider.temporary_storage_path.as_path(), hnsw_index_uuid, ).await { cleanup_errors.push((hnsw_index_uuid, e)); } } if !cleanup_errors.is_empty() { tracing::warn!("Cleanup failures: {:?}", cleanup_errors); } } ``` File: rust/worker/src/execution/orchestration/compact.rs Line: 671

rust/worker/src/execution/functions/statistics.rs

propel-code-bot · 2025-11-23T02:05:46Z

rust/worker/src/execution/functions/statistics.rs

+            (Self::SparseVector(lhs1, lhs2), Self::SparseVector(rhs1, rhs2)) => {
+                lhs1 == rhs1 && lhs2 == rhs2
+            }


[BestPractice]

For consistency with the other match arms in this PartialEq implementation, you can remove the braces and use a single expression with a trailing comma.

Context for Agents

For consistency with the other match arms in this `PartialEq` implementation, you can remove the braces and use a single expression with a trailing comma. File: rust/worker/src/execution/functions/statistics.rs Line: 175

rust/worker/src/execution/functions/statistics.rs

tanujnay112 · 2025-11-23T23:15:44Z

rust/worker/src/execution/functions/statistics.rs

            StatisticsValue::Float(value) => value.to_bits().hash(state),
            StatisticsValue::Str(value) => value.hash(state),
-            StatisticsValue::SparseVector(value) => value.hash(state),
+            StatisticsValue::SparseVector(value, token) => {


token -> label

propel-code-bot bot reviewed Oct 30, 2025

View reviewed changes

rust/worker/src/execution/functions/statistics.rs Outdated Show resolved Hide resolved